perm filename HOWTO[4,KMC]2 blob
sn#117252 filedate 1974-08-23 generic text, type T, neo UTF8
00100
00200
00300
00400 HOW TO MEASURE IMPROVEMENT OF A SIMULATION MODEL
00500 ALONG A DIMENSION OF LINGUISTIC COMPREHENSION
00600
00700 COLBY, HILF, WITTNER, PARKISON, FAUGHT
00800
00900
01000 To measure improvement one needs a scaled dimension and a
01100 value on that dimension to be striven for. In a previous
01200 communication (Colby and Hilf, 1974) a method was described for using
01300 judges to rate a paranoid simulation model's performance along a
01400 variety of dimensions. The judges consisted of randomly selected
01500 psychiatrists who rated transcripts of interviews conducted in
01600 natural language by other psychiatrists with paranoid patients and
01700 with versions of the model (PARRY1). The interviewers and the raters
01800 did not know that one of the interviewees was a computer simulation
01900 of paranoid processes.
02000 One of the rated dimensions was linguistic noncomprehension.
02100 (The negation "non" was used to keep the ratings consistent with
02200 other ratings being made at the same time). A judge rated each I-O
02300 pair of an interview along this dimension on a scale of 0-9. The
02400 judges proved to be reliable [Frank- concordance scores here on this
02500 dimension]. The mean score received by the patients was 0.74 and by
02600 the model 2.22. The difference between the two mean ratings is
02700 significant at better than the 0.001 level.
02800 Close study of the reasons for this difference revealed that
02900 the model recognized topics in the natural language input but did not
03000 sufficiently recognize exacly what was being said about a topic. The
03100 pattern-recognition processes of the model failed to pick up
03200 sufficient information about a topic to give a reply indicating
03300 comprehension. The power of a pattern- matching approach in language
03400 recognition is the ability to ignore as irrelevant both what it
03500 recognizes and what it does not recognize at all. Its weakness lies
03600 in not having enough patterns to match the tremendous variety of
03700 expressions found in natural language dialogues.
03900 To improve the language-recognition processes of the model
04000 we designed several additional techniques which we shall only outline
04100 here. A complete description of them can be found in Colby, Parkison
04200 and Faught (1974).
04300 In brief, the language-recognizing module of the current
04400 paranoid model (PARRY2) progressively transforms the input until
04500 a pattern is achieved which completely or fuzzily matches a more
04600 abstract stored pattern. (See the flow diagram of Fig. 1). The
04700 input expression is first preprocessed by translating words and
04800 word groups (such as idioms) into internal synonyms which represent
04900 our names of word classes. Words not in the recognizer's dictionary
05000 are not included in the pattern being formed. Misspellings are
05100 corrected, groups of words are contracted into single words, and
05200 certain expansions are made (e.g. "dont" becomes "do not"). The
05300 pattern is then bracketted into shorter, more manageable units
05400 termed "segments". The resultant pattern is classified as "simple",
05500 containing no delimiters, or "complex", consisting of two or more
05600 simple patterns.
05700 The algorithm then attempts a complete match of the
05800 segments with stored simple patterns. When a match is found, the
05900 stored pattern points to the name of a response function in
06000 "memory" which decides what to do next. If a match is not found, a fuzzy
06100 match is tried bt dropping elements in a segment one at a time
06200 and trying for a match each time. In the case of complex patterns
06300 this one-at-a-time dropping is carried out at the segment level. If
06400 these methods do not produce a match, a default condition obtains
06500 and the response module decides what to do.
06600 For this language-recognition strategy to be
06700 successful, a large number of words and word-combinations
06800 must be recognized and converted into patterns which match
06900 stored patterns. In the first experiment to be described, there
07000 were 1900 dictionary entries and about 2200 patterns, 1700 being
07100 simple and 500 complex.
07200
07300 EXPERIMENT 1
07400
07500 METHOD
07600
07700 Five clinicians interviewed both the old (PARRY1) and
07800 new (PARRY2) versions of the model without knowing which was which.
07900 All five agreed PARRY2 showed greater linguistic comprehension.
08000 To obtain a more precise estimate, 19 graduate students were
08100 paid to rate transcripts of these interviews. They rated each
08200 I-O pair of each interview along a dimension of "linguistic
08300 comprehension" ("Did the patient understand what the doctor
08400 said?") on a 0-9 scale.
08500 RESULTS
08600
08700 In the 10 interviews there was a total of %%%% I-O pairs.
08800 On a 0-9 scale of linguistic comprehension, the mean rating of
08900 PARRY1 was 5.256 and the mean rating of PARRY2 was 5.483. This
09000 difference is significant at the 0.05 level (t=1.0935, one
09100 tailed test).
09200 These raters also rated transcripts of the original
09300 eight interviews conducted by psychiatrists with PARRY1 and
09400 with paranoid patients. PARRY1 received a mean rating 5.19 and
09500 the patients 7.42. The difference is significant at the 0.001
09600 level. This confirms the original test using psychiatrists
09700 as raters. (Frank---how does it?)
09800 The student raters gave PARRY1 in the original interviews
09900 a mean rating of 5.19 and a mean rating of 5.26 in the experiment
10000 under discussion. The difference is not statistically significant
10100 ( SD(difference)=0.1497, t=0.45, p<0.80). We can conclude the
10200 student raters are reliable and PARRY1 generates reliable
10300 ratings from two groups of raters.
10400
10500 DISCUSSION
10600
10700
10800 The improvement (more towards the ratings received by
10900 patients) of PARRY2 over PARRY1 along the dimension of linguistic
11000 comprehension is statistically significant. However Parry2's rating
11100 of 5.48 is still distant from the rating of 7.42 received by the
11200 patients. How close should a simulation model come to its natural
11300 counterpart? Everybody knows that noboby knows. Perhaps we have
11400 reached the limit of approximation. Intuitively it seemed the model
11500 should be able to do better if we could pinpoint its most serious
11600 inadequacies.
11700 We looked at each I-O pair which received a mean rating
11800 of 5.0 or less. There were %%%% such cases. In %%% of these cases
11900 the pattern was recognized but, dues to our own errors, the pointers
12000 pointed to the wrong response functions. In the %%% remaining cases,
12100 the pattern was not recognized. We corrected the pointers and then
12200 repeated the experiment using five different clinicians who interviewed
12300 PARRY1 and PARRY2.
12400
12500 EXPERIMENT 2